One Microphone Source Separation

نویسنده

Sam T. Roweis

چکیده

Source separation, or computational auditory scene analysis, attempts to extract individual acoustic objects from input which contains a mixture of sounds from different sources, altered by the acoustic environment. Unmixing algorithms such as ICA and its extensions recover sources by reweighting multiple observation sequences, and thus cannot operate when only a single observation signal is available. I present a technique called refiltering which recovers sources by a nonstationary reweighting (“masking”) of frequency sub-bands from a single recording, and argue for the application of statistical algorithms to learning this masking function. I present results of a simple factorial HMM system which learns on recordings of single speakers and can then separate mixtures using only one observation signal by computing the masking function and then refiltering. 1 Learning from data in computational auditory scene analysis Imagine listening to many pianos being played simultaneously. If each pianist were striking keys randomly it would be very difficult to tell which note came from which piano. But if each were playing a coherent song, separation would be much easier because of the structure of music. Now imagine teaching a computer to do the separation by showing it many musical scores as “training data”. Typical auditory perceptual input contains a mixture of sounds from different sources, altered by the acoustic environment. Any biological or artificial hearing system must extract individual acoustic objects or streams in order to do successful localization, denoising and recognition. Bregman [1] called this process auditory scene analysis in analogy to vision. Source separation, or computational auditory scene analysis (CASA) is the practical realization of this problem via computer analysis of microphone recordings and is very similar to the musical task described above. It has been investigated by research groups with different emphases. The CASA community have focused on both multiple and single microphone source separation problems under highly realistic acoustic conditions, but have used almost exclusively hand designed systems which include substantial knowledge of the human auditory system and its psychophysical characteristics (e.g. [2,3]). Unfortunately, it is difficult to incorporate large amounts of detailed statistical knowledge about the problem into such an approach. On the other hand, machine learning researchers, especially those working on independent components analysis (ICA) and related algorithms, have focused on the case of multiple microphones in simplified mixing environments and have used powerful “blind” statistical techniques. These “unmixing” algorithms (even those which attempt to recover more sources than signals) cannot operate on single recordings. Furthermore, since they often depend only on the joint amplitude histogram of the observations they can be very sensitive to the details of filtering and reverberation in the environment. The goal of this paper is to bring together the robust representations of CASA and methods which learn from data to solve a restricted version of the source separation problem – isolating acoustic objects from only a single microphone recording. 2 Refiltering vs. unmixing Unmixing algorithms reweight multiple simultaneous recordings mk(t) (generically called microphones) to form a new source object s(t): s(t) |{z} estimated source = 1m1(t) | {z } mic 1 + 2m2(t) | {z } mic 2 + : : : + K mK(t) | {z } mic K (1) The unmixing coefficients i are constant over time and are chosen to optimize some property of the set of recovered sources, which often translates into a kurtosis measure on the joint amplitude histogram of the microphones. The intuition is that unmixing algorithms are finding spikes (or dents for low kurtosis sources) in the marginal amplitude histogram. The time ordering of the datapoints is often irrelevant. Unmixing depends on a fine timescale, sample-by-sample comparison of several observation signals. Humans, on the other hand, cannot hear histogram spikes1 and perform well on many monaural separation tasks. We are doing structural analysis, or a kind of perceptual grouping on the incoming sound. But what is being grouped? There is substantial evidence that the energy across time in different frequency bands can carry relatively independent information. This suggests that the appropriate subparts of an audio signal may be narrow frequency bands over short times. To generate these parts, one can perform multiband analysis – break the original signal y(t) into many subband signals bi(t) each filtered to contain only energy from a small portion of the spectrum. The results of such an analysis are often displayed as a spectrogram which shows energy (using colour or grayscale) as a function of time (ordinate) and frequency (abscissa). (For example one is shown on the top left of figure 5.) In the musical analogy, a spectrogram is like a musical score in which the colour or grey level of the each note tells you how hard to hit the piano key. The basic idea of refiltering is to construct new sources by selectively reweighting the multiband signals bi(t). Crucially, however, the mixing coefficients are no longer constant over time; they are now called masking signals. Given a set of masking signals, denoted i(t), a source s(t) can be recovered by modulating the corresponding subband signals from the original input and summing: s(t) |{z} estimated source = mask 1 z }| { 1(t) b1(t) |{z} sub-band 1 +mask 2 z }| { 2(t) b2(t) |{z} sub-band 2 + : : :+ mask K z }| { K(t) bK(t) | {z } sub-band K (2) The i(t) are gain knobs on each subband that we can twist over time to bring bands in and out of the source as needed. This performs masking on the original spectrogram. (An equivalent operation can be performed in the frequency domain.2) This approach, illustrated in figure 1, forms the basis of many CASA approaches (e.g. [2,3,4]). For any specific choice of masking signals i(t), refiltering attempts to isolate a single source from the input signal and suppress all other sources and background noises. Different sources can be isolated by choosing different masking signals. Henceforth, I will make a strong simplifying assumption that i(t) are binary and constant over a timescale of roughly 30ms. This is physically unrealistic, because the energy in each small region of time-frequency never comes entirely from a single source. However in practice, for small numbers of sources, this approximation works quite well (figure 3). (Think of ignoring collisions by assuming separate piano players do not often hit the same note at the same time.) Try randomly permuting the time order of samples in a stereo mixture containing several sources and see if you still hear distinct streams when you play it back. Make a conventional spectrogram of the original signal y(t) and modulate the magnitude of each short time DFT while preserving its phase: sw( ) = F 1 f wkFfyw( )gk\Ffyw( )gg where sw( ) and yw( ) are the wth windows (blocks) of the recovered and original signals, wi is the masking signal for subband i in window w, and F [ ℄ is the DFT. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 50

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multichannel Linear Prediction for Blind Reverberant Audio Source Separation

A class of methods based on multichannel linear prediction (MCLP) can achieve effective blind dereverberation of a source, when the source is observed with a microphone array. We propose an inventive use of MCLP as a pre-processing step for blind source separation with a microphone array. We show theoretically that, under certain assumptions, such pre-processing reduces the original blind rever...

متن کامل

Implementation and Assessment of Joint Source Separation and Dereverberation

Reverberation is known to introduce difficulties in audio source separation, and reverse engineering independent sources from a convolutive mixture is one of the toughest challenges within blind source separation. This paper proposes two novel methods that combine dereverberation work with microphone interference reduction. The results are evaluated objectively using the BSS Eval toolbox and Re...

متن کامل

Single Microphone Blind Audio Source Separation Using Short+Long Term AR Modeling

In this paper, we consider the case of single microphone Blind speech separation. We exploit the joint model of speech signal (the voiced part) that consists on modeling the correlation of speech with a short term autoregressive process and its quasi-periodicity with a long term one. A linear state space model with unknown parameters is derived. The separation is achieved by estimating the stat...

متن کامل

Post-processing for Bss Algorithms to Recover Spatial Cues

This paper addresses the problem of recovering spatial cues after microphone array processing by blind source separation. Based on the known demixing system determined by the blind source separation, we derive two spatialization algorithms. One algorithm exploits the inverse of the demixing system, while the other algorithm exploits the adjoint of the demixing system. Both algorithms are evalua...

متن کامل

'Shadow BSS' for Blind Source Separation in Rapidly Time-Varying Acoustic Scenes

This paper addresses the tracking capability of blind source separation algorithms for rapidly time-varying sensor or source positions. Based on a known algorithm for blind source separation, which also allows for simultaneous localization of multiple active sources in reverberant environments, the source separation performance will be investigated for abrupt microphone array rotations represen...

متن کامل

Real-time Blind Source Separation and Doa Estimation Using Small 3-d Microphone Array

We present a prototype system for real-time blind source separation (BSS) and directions of arrival (DOA) estimation. Our system uses a small three-dimensional array with 8 microphones and has the ability to separate signals distributed in threedimensional space. The mixed signals observed by the microphone array are processed by Independent Component Analysis (ICA) in the frequency domain. The...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2000

One Microphone Source Separation

نویسنده

چکیده

منابع مشابه

Multichannel Linear Prediction for Blind Reverberant Audio Source Separation

Implementation and Assessment of Joint Source Separation and Dereverberation

Single Microphone Blind Audio Source Separation Using Short+Long Term AR Modeling

Post-processing for Bss Algorithms to Recover Spatial Cues

'Shadow BSS' for Blind Source Separation in Rapidly Time-Varying Acoustic Scenes

Real-time Blind Source Separation and Doa Estimation Using Small 3-d Microphone Array

عنوان ژورنال:

اشتراک گذاری